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Abstract 

We present a layered Boltzmann machine (BM) that can better exploit the advan¬ 
tages of a distributed representation. It is widely believed that deep BMs (DBMs) 
have far greater representational power than its shallow counterpart, restricted 
Boltzmann machines (RBMs). However, this expectation on the supremacy of 
DBMs over RBMs has not ever been validated in a theoretical fashion. In this 
paper, we provide both theoretical and empirical evidences that the representa¬ 
tional power of DBMs can be actually rather limited in taking advantages of dis¬ 
tributed representations. We propose an approximate measure for the representa¬ 
tional power of a BM regarding to the efficiency of a distributed representation. 
With this measure, we show a surprising fact that DBMs can make inefficient 
use of distributed representations. Based on these observations, we propose an 
alternative BM architecture, which we dub soft-deep BMs (sDBMs). We show 
that sDBMs can more efficiently exploit the distributed representations in terms 
of the measure. Experiments demonstrate that sDBMs outperform several state- 
of-the-art models, including DBMs, in generative tasks on binarized MNIST and 
Caltech-101 silhouettes. 


1 Introduction 

One aspect behind superior performance of deep architectures is the effective use of distributed 
representations lllllll. A representation is said distributed if it consists of mutually non-exclusive 
features HI. Distributed representations can efficiently model complex functions with enormous 
number of variations by dividing the input space to a huge number of sub-regions with a combination 
features El- 

Recent analyses have proven efficient use of distributed representations in deep feed forward net¬ 
works with rectified linear (ReL) activations ESI. Such deep networks model complex input-output 
relationships by dividing the input space to enormous number of sub-regions, that grow exponen¬ 
tially in the number of parameters. Multiple levels of feature representations in deep feed forward 
networks successfully facilitate efficient reuse of low-level representations, and deep feed forward 
networks thus can manage an exponentially greater number of sub-regions than shallow architec¬ 
tures. 

It is interesting to ask whether deep generative models could attain such a property as deep dis¬ 
criminative models. To answer this question, it would be useful to compare restricted Boltzmann 
machines (RBMs) and deep Boltzmann machine (DBMs). RBMs are a shallow generative model 
with distributed representations Q. Deep Boltzmann machines (DBM) are a deep extension of 
RBMs 161. DBMs are commonly expected to have a far greater representational power than RBMs 
while being relatively easy to be trained compared to RBMs. However, the expectation of supremacy 
of DBMs over RBMs has not ever been validated in a theoretical fashion. 

In this paper, we provide both theoretical and empirical evidences that the representational power 
of DBMs can actually be rather limited in exploiting the advantages of distributed representations. 
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Figure 1: Illustration of 
an sDBM. All the layer 
pairs are connected in 
different magnitudes of 
strength. 



a two-layered DBM and (b) sDBM (i.e., gBM(2)). Free energy 
bounds are indicated with shaded regions. All the layers of BMs 
have only one unit. The sDBM parameters are generated with Al- 
gorithm[T]and rescaling, (best view in color) 


Our contributions are as follows. First, we propose an approximate measure for the efficiency of 
distributed representations of BMs inspired by recent analysis on deep feedforward networks asm. 
Our measure is the number of linear regions of a piecewise linear function that approximates the free 
energy function of a BM. This measure approximates the number of sub-regions that a BM manages 
in the visible space. We show that the depth does not largely improve the representational power of a 
DBM in terms of this measure. This indicates a surprising fact that DBMs can make inefficient use of 
distributed representations, despite common expectations. Second, we propose a superset of DBMs, 
which we dub soft-deep BMs (sDBMs). An sDBM is a layered BM where all the layer pairs are 
connected with topologically defined regularization. Such relaxed connections realize soft hierarchy 
as opposed to hard hierarchy of conventional deep networks where only neighboring layers are 
connected. We show that the number of linear regions of the approximate free energy of an sDBM 
scales exponential in the number of its layers thus can be as large as that of a general BM and can 
be exponentially greater than that of an RBM or a DBM. Finally, we experimentally demonstrate 
high generative performance of sDBMs. sDBMs trained without pretraining outperform state-of- 
the-art generative models, including DBMs, on two benchmark datasets: MNIST and Caltech-101 
silhouettes. 


2 Soft-Deep BMs 

We propose a soft-deep BM (sDBM): a Boltzmann Machine (BM) Q that consists of multiple 
layers where all the layer pairs are connected and connections within layers are restricted. Figure [T] 
illustrates an sDBM. The energy of an sDBM is defined as: 


E{X- 


Nk Ni 

0<l<k<L i=l i=l 


L Nk 


k—0i—1 


( 1 ) 


where ^ of parameters, is the state of the layer, 

and X = {Vj'H} is the set of all the units with % = being the set of hidden layers, and 

V being the visible layer. We number layers s.t. is the visible layer, and x*^^) is the fc**' 
hidden layer. Let L be the number of layers, N be the total number of units, N]^ be the number of 
units in A:**' layer, A^vis be the number of visible units, and At^id be the number of hidden units. An 
sDBM assigns probability p(<T; 6) c>c exp{—E{X; 9)) to a configuration X C {0,1}^. 

RBMs and DBMs are subsets of sDBMs; RBMs are sDBMs of L = 1, and DBMs are sDBMs where 
= 0 for A: - Z > 1. 
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3 Quantifying the Efficiency of Distributed Representation in BMs 


In the this section, we dehne an approximate measure for the representational power of a BM based 
on its free energy function. We compare various BMs in terms of this measure and show that sDBMs 
could attain richer representations than DBMs and RBMs. 

The free energy of a BM is defined as the negative log probability that the network assigns to a 
visible conhguration without normalizing constant; 

i^(v; 0) 4 - log ^ exp(-T;(v, 0)), (2) 

n 

where denotes summation over all the hidden configurations, and 6 is the set of parameters. We 
would be able to measure the representational power of a BM with the complexity of the free energy 
function because the free energy function contains all the information on the probability distribution 
that a BM models. 

3.1 Hard-min Approximation of Free Energy 

We here dehne a piecewise linear approximation of the free energy function of a BM. For RBMs, 
it is widely known that the free energy function can be well approximated with a piecewise lin¬ 
ear function la. This idea can be extended to general BMs that do not have connections between 
visible units, which include sDBMs, as follows: the operation log exp in Eq. (|^ can be re¬ 
garded as a relaxed max operation; the sum is virtually dominated by the smallest energy, i.e., 
exp(—i?(v, T-L)) K, exp(— min-^ T^(v, T-L)) where min^ denotes min operation over all possi¬ 
ble hidden conhgurations. The negative logarithm of the sum is thus nearly min>^ E{v,'H). Based 
on this observation, we dehne following approximation of the free energy: 

Dehnition 1. Hard-min free energy F of a BM with parameters 8 is dehned as: 

F(v; 0) = mini5(v, 0). (3) 

T~L 


Note that F(v; 9) is a piecewise linear function if the BM does not have connections between visible 
units because E{v, H; 9) does not have interactions involving multiple visible units. 

Formally, we can show that F bounds F as; 

Theorem 2. Let i?res(v) = — log{^^ exp(—i?(v, H)) — exp(—F(v))}. Then the free energy 
F{'v) is bounded as: 

F{v) - exp(F'(v) - £;res(v)) < F(v) < Fmf(v) < F'(v), (4) 

where Fmf mean-filed approximation of the free energy. 

The tightness of the bound is determined by the dominance of minimum energy F over the free 
energy. The difference between the upper and the lower bounds becomes fairly tight if F is smaller 
than E^ssi the contribution of the non-minimum energies on the free energy. 

Theorem]^ shows that F is a very rough approximation for the free energy; F is less accurate than 
mean-held approximation Fmf- Nevertheless, the bound can be tight except points where several 
energy terms nearly achieve the minimum, e.g., boundaries between linear regions of F. Figure]^ 
demonstrates this idea. Therefore, we will be able to roughly measure the complexity of the free 
energy of a BM through quantifying the complexity of F. 

A natural way to quantify the complexity of a piecewise linear function is to count the number of 
its linear regions. To quantify the representational power of a deep feedforward network with ReL 
activation, this strategy was recently applied to the piecewise linear input-output function Em. 
Inspired by these analyses, we propose to use the number of the linear regions of F to measure a 
BM’s representational power. Intuitively, this measure roughly indicates the number of effective 
Bernoulli mixing components of a BM; F with k linear regions will be well approximated by the 
negative probability function of a mixture of k Bernoulli components by assigning each component 
to each region. We therefore shall call this measure the number of effective mixtures of a BM; 
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Definition 3. Suppose a BM with no connections between visible units. The number of effective 
mixtures of the BM is the number of linear regions of the hard-min free energy F of the BM. 

Obviously from Definitions andthe maximal number of effective mixtures of a BM is bounded 
above by the number of its hidden configurations: 

Proposition 4. The number of effective mixtures of a BM is upper bounded by 

Note that this proposition tells us nothing about whether this bound is actually achievable by a BM 
with a certain parameter configuration; we provide positive results in later sections. 

The number of effective mixtures of a BM approximately measures the efficiency of the distributed 
representation. Each configuration of a distributed representation can give rise to a linear region 
of F. Therefore, an efficient distributed representation of a BM potentially manages sub- 

regions in the visible space. The efficiency, however, can substantially be damaged by restricted 
connections. 

For deep feedforward networks with ReL, Montufar et al. ||4| showed that a deeper network can 
model a piecewise linear function with much more linear regions than a shallow network with the 
same number of parameters. The number of the linear regions grows exponentially in the number 
of the layers. Now we ask a question: is this also true for DBMs in terms of the approximate free 
energy FI Surprisingly, the answer is NO. We shall provide proofs in the following sections. 

3.2 The Number of Effective Mixtures of an RBM 

We first analyze RBMs. The free energy function of an RBM can be approximated with a 2-layered 
feedforward network with ReL n. The number of linear regions of the input-output function of 
such a shallow network has been studied by Pascanu et al. IS] and Montufar et al. ||4l. With slight 
modification on their results, we can compute the maximal number of effective mixtures of an RBM: 

Theorem 5. The maximal number of effective mixtures of an RBM is i^/)' 

Note that this bound is quite smaller than the upper bound in Propositionfor Ni > Nq because 

e7=o { 7 ) = «2^^. 

3.3 The Number of Effective Mixtures of a DBM 

Next we analyze DBMs. Here we provide lower and upper bounds on the maximal number of 
effective mixtures of DBMs. We have a lower bound because DBMs are a superset of RBMs: 
Proposition 6. The maximal number of effective mixtures of a DBM is lower bounded by 

S^No (Ni\ 
l^j=0 \ j )■ 

A key idea of the proof on an upper bound, which we show in the appendix, is that energies as¬ 
sociated with a same configuration in the first hidden layer have an identical gradient in the 
space of V. For example, = 0) and E{v,h^^'^ = = 1) have the 

same gradient i.e., slope in Fig. (a). This is because does not affect the statistics of v 
given . The number of linear regions of F is therefore bounded by 2^^ = 2^ because one 
of the energy terms with the same slope become globally smaller than the other energy term, e.g., 
E{v,h^^^ = 0, = 0) < E{v,h^^^ = = 1) for any v. This generalize to any DBMs 

leading to a natural but somewhat shocking result where the bound only depends on the number of 
units in the first hidden layer: 

Theorem 7. The number of effective mixtures of a DBM with any number of hidden layers is upper 
bounded by 2^^. 

Depth does not largely help the number of effective mixtures of DBMs. This suggests that a dis¬ 
tributed representation is inefficiently used in a DBM at least in the scope of the approximate free 
energy F. From Proposition and Theorem we can readily show a serious limitation on the 
number of effective mixture of DBMs: 

Proposition 8. The number of effective mixture of a DBM with Ni > Nq never achieves the bound 

2-^Vhid_ 
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Algorithm 1 A recursive construction of gBM(L) 

function softDeep(L) 
if L = 0 then 

^ 0 for 0 < I < k < oo 
b^^'> ^ 0 for 0 < fc < 00 

return 

else 

{&('=)} ^ softDeep(L - 1) 
a; ^ 2^-1 

lljiL,0) ^ jj, 

^0.5a;(l-a;) 

for Z = 1 to L — 1 do 

end for 
return 
end if 

end function 



Figure 3: (a): gBM(L) for L G {1,2,3} with 
unit indices and connection strengths, (h) to (d); 
F (printed in black) and E{v,u) G S{L) (in 
gray) with L = (b)l, (c)2, and (d)3. 


3.4 The Number of Effective Mixtures of an sDBM 

The key to the limited number of effective mixtures of DBMs is the independency between v and 
...} given Conversely, if there exists dependency between the visible and the 

upper hidden layers even given the limitation over the number of effective mixtures will not 
hold. Bypassing connections of sDBMs therefore might improve the number of effective mixtures. 
Figure ^(b) demonstrate this idea by showing that F of an sDBM attains = 2^ linear regions 
with properly chosen parameters. In this section, we refine this idea for general sDBMs. 


3.4.1 General BMs as Elemental sDBMs 


We first analyze the number of effective mixtures of a general BM with only one visible unit, which 
can be regarded as an elemental sDBM. Let gBM(L) be a general BM with L hidden units and one 
visible unit whose energy function is defined as: 

(5) 

0<l<k<L k—0 


where we defined v = x^^\ = x^^\ and ..., for 0 < Z < fc < L. Because 

we regard a gBM(L) as an elemental sDBM with L layers each of which has only one unit, we index 
units and parameters with superscripts. We may call the unit of a gBM(L) the fc**’ layer because 
of the same reason. Let S {L) be a set of one dimensional linear functions defined over the visible 
unit: S{L) ^ G {0,1}^}. 


We first analyze an arrangement of the elements of S{L) with a network construction procedure and 
then analyze the number of linear regions of F under this arrangement. The procedure is listed in 


Algorithm [T] where 
from the L^unit to 


a network is constructed by appending a unit in a recursive manner, starting 
the unit (see Fig. [^(a) for example). With this construction, we can show 
that all the elements of S{L) are arranged to be a tangent of a quadratic curve at 2^ different points: 


Lemma 9. Assume that are computed with SOFTDEEP(M)/or a large integer M. 

Then elements of S{L) for 0 < L < M are tangents of a quadratic function with equally spaced 
points of tangency. 


From Lemma]^ we can readily show that: 

Lemma 10. The number of effective mixtures of a gBM{L) reaches 2^ = the bound in 

Proposition^ when parameters are properly set. 
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Figure 4; Heatmap of F of a 
two-layered sDBM. The sDBM 
is constructed as a bundle of two 
independent gBM(2)s. Lines 
indicate the boundary of linear 
regions of F. The parameters 
are computed with Algorithm [T] 
and with rescaling. 


Figures [^from (b) to (d) demonstrate the statements of Lemma|^ 
andfor L G {1, 2, 3}. Different hidden units control the slope 
of the energy in different levels of magnitude (e.g., 1, 2,4, 8,...). 
This allows F(u, •) to have mutually different slopes and thus 
leads to 2^ effective mixtures. 

We call connections determined with Algorithm [T]io/f-t/eep con¬ 
nections because the connection strengths can be regarded to de¬ 
cay exponentially in the distance between layers. Let us have 
units aligned in a sequential order fc = 0, L,..., 1 as in Fig.[^(a). 
The strength of a connection from a unit to an upper unit which is 
d units away from the unit under this spatial configuration is pro¬ 
portional to 2““^. We observe that this connection pattern is soft 
counterpart of the conventional deep connection pattern where 
only adjacent layers are connected. 

3.4.2 Main Results 

By applying Lemma [T0| to an sDBM constructed as a bundle of 
independent gBM(L)s, we can show that the maximal number of 
effective mixtures of an L-layered sDBM scales exponentially in 
L: 


Theorem 11. Suppose an sDBM with L hidden layers each of 
which contains M{< A^vis) units. Then the number of effective mixtures of this sDBM reaches 
2 ML _ 2 JVhid^ bound in Proposition^^ with a certain parameter configuration. 


Figure demonstrates the claim of Theorem 11 the free energy function of an sDBM with four 
hidden units can be well approximated with F that has 2^ = 16 linear regions. 


Along with the analysis on DBMs and RBMs, Theorem[TT]indicates that soft-deep connections that 
bypass between remote layers can be vital for a deeply layered BM to have superior representational 
power to shallow one in terms of the efficiency of a distributed representations. This clearly contrasts 
with feedforward networks where bypassing connections do not critically affect the representational 
power 191. 


4 Remarks 

There are two appealing properties of sDBMs other than the huge number of effective mixtures. 
First, fast block Gibbs sampling can be performed. Although sampling efficiency degrades com¬ 
pared to DBMs due to the dependency between hidden layers introduced by soft-deep connections, 
we believe that benefits from the huge representational power offset this negative effect. 

Second, soft-deep connections can ease difficulties in learning deeply layered BMs. Because DBMs 
do not have connections between remote layers, the effect that the visible layer exerts on remote 
hidden layers decays exponentially in the depth. This phenomenon will hinder learning signals 
from correctly propagating through deep layers. We believe that one of the benefits of pretraining 
is to help this stochastic vanishing gradient effect. The soft-deep connections ease this problem 
by bypassing between the visible layer and remote hidden layers. We believe that high generative 
performance of sDBMs without pretraining shown in Section]^ is achieved not only with the huge 
representational power proven in Theorem[TT] but also with the less severe vanishing gradient effect. 

4.1 Soft-Deep Regularization 

For the number of effective mixtures of an sDBM to scale exponentially in the depth as in Theo¬ 
rem [m it is essential that connection weights are in multiple levels of magnitude. Without regular¬ 
ization or with uniform regularization, networks do not attain such property via learning. To address 
this point, we introduce soft-deep regularization where strength of L2 regularization for connec¬ 
tions between and layers is inversely proportional to where are computed 

with Algorithm [T] and p is a hyper parameter. Although this technique does not strictly guarantee 
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that the weights scale as we experimentally observed that this regularization improves the 

performance of sDBMs. 


5 Experiments 


We have discussed representational power of BMs based on approximated free energy F. To validate 
the approximation, we experimentally demonstrated advantages of sDBMs. We performed experi¬ 
ments on two datasets: MNIST digits in and Caltech- 101 silhouettes lim . We used Theano m 
and pylearn2 m to implement sDBMs. We used stochastic maximum likelihood III to jointly 
train networks with the centering method m and soft-deep regularization. We did not perform 
pretraining. We scheduled learning rates to linearly decay from an initial value to zero. 

To evaluate networks, we used AIS mini to estimate the variational lower bound for the average 
log-likelihood on test data. We evaluated the reliability of estimates by computing 3 ct confidence 
intervals, which we show in the supplementary material. 

On both datasets, we trained 2-, 3-, and 4-layered sDBMs with various hyper parameters. The num¬ 
ber of units in each hidden layer is fixed to 500. Hyper parameters are tuned via random sampling 
d. See supplementary material for detail. 



Figure 5: Random samples from a 4-layered sDBM (left in each cell) displayed with nearest training 
(center) and test samples (right) from binarized MNIST. Generated images are probabilities that 
pixels are sampled from. The nearest neighbors are computed in terms of pixelwise distance. 
The sDBM does not simply memorize training examples but generalize to unseen test examples. 


5.1 Binarized MNIST 

MNIST is a collection of gray scaled digit images that consists of 60,000 training samples and 
10,000 test samples ifTOl . We binarized the images following the procedure by Salakhutdinov and 
Murray ITtII to generate training and test data. 

Table [T] compares sDBMs and various models in the literature in terms of generative performance. 
We can see that sDBMs greatly performed compared to other models. Even the 2-layered sDBM 
outperformed the previous state-of-the-art test log-likelihood of —80.97 nat by a recent report HD. 
The best-performing 4-layered sDBM achieved —66.56 nat of test log-likelihood with a Str confi¬ 
dence interval of [—67.01, —65.70] nat. 

Note that sDBMs with 2- and 3-hidden layers outperformed DBMs with the same number of layers 
II 20 I . This result would be seen to reflect the exponentially greater number of effective mixtures of 
sDBMs than of DBMs with the same number of parameters. 

The depth of networks largely improved the performance, though improvement of the 3-layered 
model over the 2-layered model was relative small. We believe that this effect is due to insufficiency 
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Figure 6; Random samples from a 4-layered sDBM trained on Caltech-101 silhouettes displayed 
with nearest training and test examples as in Fig. 


in parameter tuning; The 3-layered model performed worse than 2-layered model on training data 
as shown in the supplementary material. The performance of models would uniformly improve as 
the depth of networks with more precise parameter tuning. 

Figure shows random samples from the best performing sDBM. The sDBM well generalizes to 
unseen test examples. 


Table 1: Comparison of generative performance of various generative models on binarized MNIST. 
We report average test log-likelihood measured in nat. 


Model 

Test LL > 

Model 

Test LL 

> 

RBM 117| 

«-86.34 

DLGM 8 leapfrogs 121| 

«-85.51 

-88.30 

DBN 2hl Il22l 

«-84.55 

DARN Ihl l23l 

«-84.13 

-88.30 

DBM 2hl ||20l 

«-83.43 

DARN 12hl l23l 


-87.72 

DBM 3hl |l20l 

«-83.02 

DRAW im 

- 

-80.97 

NADE 1241 

-88.33 

sDBM 2hl 

- 

-76.41 

EoNADE 2hl l25l 

-85.10 

sDBM 3hl 

_ 

-74.58 

EoNADE-5 2hl E6l 

DLGM Ell 

-84.68 
«-86.60 

sDBM 4hl 

— 

-66.56 


5.2 Caltech-101 silhouettes 

Caltech-lOl silhouettes is a collection of binary silhouette images of various objects ifTTl . The 
dataset contains 4,100 training samples and 2,307 test samples. 

Table compares sDBMs with several other models on generation of Caltech-lOl silhouettes. 
sDBMs outperformed the previous state-of-the-art by NADE trained with reweighted wake-sleep 
(RWS) algorithm IJS). The 4-layered sDBM achieved —85.55 nat of test log-likelihood with a 
3cr confidence interval of [—85.67, —85.40] nat. This result is the best average test log-likelihood 
achieved on Caltech-lOl silhouettes to the best of our knowledge. 

The depth improves the performance of sDBMs. We believe that less precise parameter tuning 
resulted in poor performance of 2-layered model as in experiments with MNIST. 

Figure [^ shows samples generated from the best performing 4-layered model. The most samples 
proves nice generalization by the network though some samples resemble training examples. 
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Table 2: Comparison of generative performance of various generative models on Caltech-101 sil¬ 
houettes. We report average test log-likelihood as in Table[T] 


Model 

Test LL 

Model 

Test LL 

RBM i29l 

-109.0 

NADE-RWS 

-104.3 

RBM 123 

-107.8 

sDBM 2hl 

>-92.4 

NADE-2 l26l 

-108.8 

sDBM 3hl 

>-98.7 

NADE-5II26I 

-107.8 

sDBM 4hl 

>-85.5 


6 Conclusion 

In this paper, we proposed a BM architecture that can better exploit the distributed representation. 
We proposed a measure for the efficiency of a distributed representation of a BM, the number of 
effective mixtures of a BM, which is the number of linear regions of a piecewise linear function that 
approximates a free energy function of a BM. We showed inefficiency of DBMs with respect to the 
maximal number of effective mixtures. We proposed sDBMs, an extension of DBMs. We showed 
that the maximal number of effective mixtures of an sDBM is exponentially larger than that of a 
RBM or a DBM. Finally, we experimentally demonstrated high generative performance of sDBMs. 
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A Boltzmann Machines 


A Boltzmann machine (BM) is a stochastic generative model, which is typically defined over N 
binary units Xi € {0,1}. The probability that a BM assigns to a state X = {a;^} is defined as 

exp(-E(X;d)), (6) 

where the normalization constant, or the partition function of the BM is denoted by Z(0), and its 
energy function is defined as 

N N N 

E{X;6) = - '^'^XiWijXj -'^hxi,s (7) 

2=1 j = l 2=1 

where Wij are symmetric (i.e., Wij = wj^i, Wi^i = 0) connection weights between units i and j, bi 
are biases, and 9 is the set of the parameters. 

A BM with visible and hidden units can have rich representations; visible units Vj G v correspond 
to data variables, and hidden units hi G H correspond to latent features of data. All the units are 
either a visible units or a hidden unit (i.e., X = {H, v}). The numbers of visible and hidden units 
are denoted by iVvis and iVhid. 

Various network topologies that restrict connections of BMs have attracted great research interests 
Il5]|6l|30l[3l]. Albeit general BMs are a superset of such BMs with restricted connections, general 
BMs are rarely used in practice. The main problem with general BMs is the difficulty due to the 
intractability of the expectations with respect to data-dependent and model distributions. One ap¬ 
proach is to approximate expectations via expensive MCMC ll^ . The relaxation time of a Markov 
chain can be quite long because general BMs have enormous number of well-separated modes to 
be explored. Moreover, dense connections of BMs require generic Gibbs sampling which updates 
only one unit at a time. Restriction on connections can alleviate these issues. Particularly, BMs 
with layered connection patterns are widely studied because of their appealing properties such as 
efficiency in sampling, less-complex energy landscapes, and simplicity of learning algorithms. We 
here review two representative layered BMs: Restricted BMs (RBMs) and Deep BMs (DBMs). 

A.1 RBMs 

An RBM is a BM with a bipartite graph that consists of a visible layer and a hidden layer. Connec¬ 
tions within each layer are restricted 0. The energy function of an RBM is defined as 

Ni No No Ni 

i—1j—1 j—^ i—1 

where denotes the states of the (first) hidden layer, and ^6 model 

parameters. We here use redundant notation with layer indices associated with a superscript to avoid 
confusion of notations for models which we shall describe in later sections. 

RBMs exhibit a nice property that conditional distributions and p(v|h(^^) are tractable 

and factorized. This allows us to perform fast block Gibbs sampling and makes the data-dependent 
expectation tractable. 

However, such tractability substantially sacrifices the representation power of RBMs. 


A.2 DBMS 


DBMs are an extension of RBMs that have multiple hidden layers that form deep hierarchy. Con¬ 
nections within each layer are restricted, and units in a layer are connected to all the units in the 
neighboring layers |0. The energy function of a DBM with L layers is 


L-l tVfe + i Nk 


E{x-e^^^) = -Y^ 'X) 


(fc + l)yxr(/C + l,fc) (k) 


L ATfc 




k—0 i—1 j = l 


k—0 2=1 


(9) 
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where we number layers s.t. the 0th layer is the visible layer, and the fc**' layer is the hidden 
layer. The state of the layer is denoted by hence the state of the fc**' hidden layer 

is for 0 < k < L, and the state of the visible layer is v = x^°\ {vj = 

Let Nk be the number of units in k^^ layer (i.e., A^vis = Nq, iVhid = J2k=i ^k)- 

DBMs have several appealing properties. Fast block Gibbs sampling is also applicable to DBMs, 
as to RBMs. Particularly, block sampling is highly efficient because conditional distributions of the 
even layers given the odd layers and those of the odd layers given the even layers are tractable and 
factorized. Moreover, DBMs possess greater representation power than RBMs because of multiple 
hidden layers. 

However, the improved representation power causes a serious difficulty. The data-dependent expec¬ 
tation needs to be approximated in learning because the conditional distribution |v) is no longer 
tractable; stochastic approximation procedure M or variational inference ||6l f33]l is used for ap¬ 
proximation. At the appearance of DBMs, Salakhutdinov and Hinton lb) introduced a pre-training 
algorithm to ease this problem. Recently, the centering method is proposed for joint training of 
DBMs without pre-training ifTSll . 

Upon the introduction of DBMs, DBMs would have been expected to be scalable, i.e., great per¬ 
formance improvements can be achieved with DBMs by stacking a layer as in other deep neural 
models. However, experiments suggest that this seems not true; improvements are hard to be gained 
with very deep BMs with more than 3 hidden layers even with elaborated learning algorithms ©. 
It is widely conceived that the poor scalability of DBMs is attributed that we cannot exploit huge 
representation capacity of DBMs due to inefficient optimization methods. This will be true to some 
extent. We, however, shall provide both empirical and theoretical evidences that the poor scalability 
of DBMs is not only due to the optimization issues, but also because of rather limited representation 
capacity of DBMs. 


B Proof of Theorems 

Theorem 2. Let i?res(v) = — logj^.^^ exp(—i?(v, "H)) — exp(—F(v))}. Then the free energy 
F{'v) is bounded as: 

F{v) - exp(F(v) - E^esiy)) < F(v) < Fmf(v) < F(v), (10) 

where Fmf A the mean-filed approximation of the free energy. 


Proof. We first show the upper bound. For any approximating posterior Q{T-L\v), We have 

F{^) = E Qm-^)E{n, v) - E Qm^) log - KL (Q(.|v)|P(.|v)), (11) 


where P{T-L\v) denotes the model’s true posterior distribution. Suppose we have a following ap- 
promximating posterior: 


Q(H|v) 


1 (P = P(v)) 

0 {otherwise) ’ 


( 12 ) 


where we defined P(v) = argmin^ E{v,'H). Note that this posterior factorizes. With this poste¬ 
rior, Eq. (Ill becomes 


P(v)=P(v)-KL (g(.|v)|P(.|v)). 


(13) 


we have F(v) < P(v) because KL ^(5(*|v)|P(»|v)^ = — logP('H(v)|v) > 0. The equality 

holds if and only if the true posterior has its all the mass on P(v), i.e., Pj'Hjv) = Q{'H\\-). We 
have an inequality Fmf (v) < F{v) readily from the definition of the mean-field free energy because 
Q is factorized. 


13 




Next, we prove the lower bound as: 


F(v) - F(v) = F(v) + log^ exp(-i:;({H, v})) 

n 

( exp(-^(v)) + exp(-^res(v)) 

® \ exp(-f’(v)) 

= log (^1 + exp(-£;res(v) + F{v))^ 

< exp(F(v) - i;res(v)), 

where we used log(a;) < x— 1 in the last line. The equality holds if and only if F{v) = i?i.es(v). □ 

B.l On The Number of Effective Mixtures of an RBM 

Theorem 5. The maximal number of effective mixtures of an RBM is i^/)- 

Proof The hard-min free energy of an RBM can be written as Ffv] ~ 

max(0, The number of linear regions of this function is the number 

of regions separated by hyper-planes each of them satisfies J2f=i = 0 for 

0 < i < iVi. The number of these regions is J2f=o i,^/) JMI- This proves the claim. □ 

B.2 On The Number of Effective Mixtures of a DBM 

Here we provide lower and upper bounds for the maximal number of effective mixtures of DBMs 
with respect to the parameters. Let us begin with a lower bound. Because DBMs are a superset 
of RBMs, the number of effective mixtures of a DBM can be as large as the maximal number of 
effective mixtures of RBMs. This observation leads us to a lower bound: 

Proposition 6. The maximal number of effective mixtures of a DBM is lower bounded by 

•^No (Ni\ 
l^j=0 \ j )■ 

We here outline the idea of the proof of Theorem A key observation is that the 

marginal distribution over visible units of a DBM is written as a summation: p(v; = 

^h(i) p(v|h*^^T where is the marginal distribution over 

This indicates that the number of the mixing components of a DBM is 2^^; this bounds the maxi¬ 
mal number of effective mixtures from above . These observations lead to a natural but somewhat 
shocking result where the bound only depends on the number of units in the first hidden layer: 

Theorem 7. The number of effective mixtures of a DBM with any number of hidden layers is upper 
bounded by 2^^^. 

Proof Suppose a set of linear functions ..., v})|h*^*^ € 

{0, l}^''for 2 < k < L}. Linear functions within this set / G have an identical 

gradient as /(u;h(^)) = -I- C where C is a constant that only depends on 

..., and ~ Therefore, minS'(h(^^) is a linear 

function /inin(v; + C'min with Cmin = ininh(2)_ hfi') C'- The hard-min 

free energy of a DBM is = min^ci) minS'(h^^^) = minjj(i) /min(v; and its 

maximal number of linear regions is bounded above by the number of configurations of i.e., 

2 ^ 1 . □ 

These results depict a serious limitation on the representation power of DBMs. There are two ways 
to increase the number of effective mixtures of a DBM. The first way is to stack layers. However, 
the number of effective mixtures never become greater than 2^L which is solely determined by A^i. 
Therefore, depth does not largely help the capacity of DBMs measured in the number of effective 


(14) 

(15) 

(16) 
(17) 
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mixtures. The second way is to increase Ni. This strategy, however, at least necessitates the presence 
of second layer units, which does not improve the bound 2^^. Otherwise, the DBM is equivalent to 
an RBM, and its maximal number of effective mixtures is merely 0(iVi^°). Therefore, the number 
of effective mixtures of a DBM is smaller than the upper bound in Proposition]^ 

Proposition 8. The number of effective mixture of a DBM with Ni > Nq never achieves the bound 

2-^Vhid_ 


Proof First, suppose that the DBM has no hidden layers above the first layer. This DBM is equiv¬ 
alent to an RBM, thus from Theorem]^ the maximal number of effective mixtures of this DBM is 
smaller than . Next, suppose that the DBM has more than one hidden units in its third hidden 
layer with non-zero connection weights between units in the second hidden layer. From Theorem ^ 
the number of effective mixtures of this DBM is bounded above by 2^^ < . This proves the 

claim. □ 

B.3 On The Number of Effective Mixtures of an sDBM 

Lemma 9. Assume that are computed with SOFXDEEPfMj/or a large integer M. 

Then elements of S{L) for 0 < L < M are tangents of a quadratic function with equally spaced 
points of tangency. 

Proof We here show the claim with induction with a quadratic function 

/(a;(°)) = -f 1) -f 0.25). (18) 

As in main text, let S{L) be a set of linear functions € {0,1}^}. Assume 

that elements of S(L — 1) are a tangent of /(a;*-°^) where the point of tangency is = 

x^^^2^~^ — 0.5, and the slope is — X]fc=r We divide S(L) into two sets 

and each of which is a set of lines that correspond to either x^^'^ = 0 or = 1. 

We can readily show that elements of S^(l)^q{L) are a tangent of because S{L — 1) = 

We can show the tangency of elements of S^(l)^i{L) as follows. Let x^^''^~^'>) be an 

element of S^(l)^^{L) with hidden configuration for 77 £ {0,1}, i.e.. 



(19) 

Let us consider the difference g^(L)-i(x'^^'^',x^'^''^~^'>) — g^{L)= 
C^{i-.L-i) where and can be computed as follows: 


TJ (L,0) cyL — 1 

(20) 

and 



(21) 

0<1<L 


= -u,(L,o) a:(')2'-i+0.5((u;(^’°))2-u;(^'°)) 

(22) 

0<1<L 


= - 0.5^ 

(23) 

= 0.5(i?3,(l;I,-l) , 

(24) 

where we used 


From Eqs. 20 and 24 gj;(L}^i{x^^'^ ■, x^^'^ is a tangent of f{x) because the difference between y- 


intercepts Ca and Ca+p of two tangents of a quadratic function ax"^ + bx + c with slopes a and a + /3 

a2 

is calculated as Cq,+^ ~ ^ol = — ^ ~ Pxa where Xa is the point of tangency of the line with slope 
a. The point of tangency of is -f — 0.5 
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and the slope is — J2k=i ^ = — X)fe=i ^. Therefore, elements of S{L) are 

a tangent of /(a:^°^) if elements of S{L — 1) are a tangent of 

Observe that S'(O) contains only one element g(x^^^) = 0; this is a tangent of f(x^^^) at the point of 
tangency x^^^ = —0.5. Therefore, elements of S(L) are a tangent of f(x^^^) for any L < M. This 
proves the claim. 

□ 

Lemma 10. The number of effective mixtures of a gBM{L) reaches 2^ = the bound in 

Proposition^ when parameters are properly set. 

Proof. Assume a gBM(L) whose parameters are generated with SOFTDeep(L). From Lemma 
an element of S{L) is a tangent of + 1) + 0.25) at different points. 

Because / is strictly concave, > f{x^^^) where g G S{L) and the equality holds at the point 

of tangency. Therefore, for g,g G S{L) {g g), g{x) = f{x) < g{x) where x is the point of 
tangency of g. Thus, at a neighbor of x, = g{x). Because elements of S{L) are tangents of 

f{x^^^) at 2^ different points, the number of effective mixtures of the gBM(L) is 2^. This proves 
the claim. □ 

This result directly indicates that general BMs with more than one visible units can also achieve the 
maximal number of effective mixtures 2^***'^ with connection weights determined by our construc¬ 
tion procedure where visible-hidden connections are replicated for all the visible units. 



Figure 7: Illustration of an sDBM, which is used in the proof of Theorem [TT] 

Theorem 11. Suppose an sDBM with L hidden layers each of which contains M{< N^is) units. 
Then the number of effective mixtures of this sDBM reaches 2^^ = 2'^***'*, the bound in Proposi¬ 
tion^ with a certain parameter configuration. 

Proof. Assume an sDBM constructed as a collection of M independent gBM(L)s each of which has 
parameters generated with SOFTDeep(L). Then, F(v; 

Because each has 2^ linear regions, the number of linear regions of .F(v; 

is 2*^^. This proves the claim. □ 

C Connection to Biological Neural Nets 

There has been increasingly more intense research interests on the connection between deep neural 
networks and biological neural networks 051 . One prevalent aspect is that layers of deep neural 
networks correspond to cortical regions that form hierarchy O^ . However, unlike conventional deep 
networks, it is widely known that biological neural networks have many connections that bypass 
between functionally remote cortical regions (e.g., between VI and MT) 071 . Because bypassing 
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V 


Figure 8: F of gBM(3) displayed in a large size. F is printed in black, and E{v, •) G S{L) are in 
gray. 


connections do not largely contribute to the representation power of feedforward neural networks 
13, recent great success of deep feedforward networks do not explain the functional role of such 
bypassing connections in our brain. Our results on sDBMs may help us to understand this mystery. 


D Details of Experiments 

D.l Parameters 

We tuned hyper parameters via random sampling; initial learning rates were sampled from 10“ 
for MNIST and from 10“^'^’^ ®! for Caltech-101 silhouettes, strengths of L2 regularization were 
sampled from 10“1^’^1, rj was sampled from [0.5, 3.5], and update constants for the centering pa¬ 
rameters were sampled from 10“[®’®1. We generated 16 configurations of hyper parameters for each 
experiment setting. The number of parameter updates was 10®. 

Networks were trained with stochastic maximum likelihood m. We did not perform variational 
inference 0. The number of positive phase Markov chain updates per parameter update was 5 for 
MNSIT and 1 for Caltech-101 silhouettes. The number of negative phase Markov chain updates per 
parameter update was 5. The batch size was set to 100. 

D.2 AIS 

Throughout the training, we monitored the training and test log-likelihood of models by occasionally 
performing AIS. Such monitoring AIS was executed with rather cheap settings of 100 runs and 
30,000 intermediate distributions. After training, we performed more expensive AIS on several 
best performing models evaluated via cheap AIS to gain thorough estimates. This expensive AIS 
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Table 3: Details of AIS estimates for sDBMs trained on MNIST. Estimated variational lower bounds 
on training and test data are reported. Scr confidence intervals are also reported in parentheses. 


Model 

Train FF 

Test FF 

sDBM 2hl 
sDBM 3hl 
sDBM 4hl 

-68.80 (-68.91,-68.67) 

-71.17 (-71.58,-70.48) 

-61.90 (-62.36,-61.04) 

-76.41 (-76.53,-76.28) 

-74.58 (-74.98,-73.89) 

-66.56 (-67.01,-65.70) 


Table 4; Details of AIS estimates for sDBMs trained on Caltech-101 silhouttes as in Table [3 


Model 

Train FF 

Test FF 

sDBM 2hl 
sDBM 3hl 
sDBM 4hl 

-30.16 (-30.35,-29.92) 

-72.62 (-72.69,-72.56) 

-38.16 (-38.29,-38.02) 

-92.37 (-92.56,-92.13) 

-98.66 (-98.72,-98.59) 

-85.55 (-85.67,-85.40) 


is executed with at least 1,000 runs and at least 300,000 intermediate distributions. All the figures 
reported in the main text were gained with such expensive AIS. 


D.3 Samples from sDBMs 

Figures and [T^ show consecutive samples from the best-performing 4-layered sDBMs. These 
figures demonstrate nice mixing of Markov chains between several classes. 


?' r r rr r\ 
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Figure 9: Consecutive samples generated from a 4-layered sDBM trained on MNIST. 
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Figure 10; Consecutive samples generated from a 4-layered sDBM trained on Caltech-101 silhou¬ 
ettes. 
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